Tag coexistence detection

ABSTRACT

In various embodiments, a method for optimizing data storage includes receiving an input data stream, where each data record received in the data stream is tagged with a group of one or more tags. The method further includes, for each data record of data records that have been received in the data stream, using the group of one or more tags of the corresponding data record to update a data structure tracking coexistence implications of tags that have been observed together in the groups of tags of the data records. The method further includes using the data structure tracking coexistence implications of tags to optimize a query.

RELATED APPLICATIONS

This application is a continuation application of and claims priority toand the benefit of co-pending U.S. application Ser. No. 15/601,942,filed on May 22, 2017, entitled “TAG COEXISTENCE DETECTION,” by ClementHo Yan Pang, having Attorney Docket No. D758, and assigned to theassignee of the present application.

BACKGROUND OF THE INVENTION

Data management has become more challenging with the increasingpopularity of cloud and on-premise products offering a variety oftechnological services to users. Conventional techniques for monitoringthese systems are unable to effectively manage applications thatgenerate large quantities of data. In one aspect, conventionaltechniques for managing these types of systems are typically slow and/orinefficient at handling queries regarding stored data. Additionalchallenges include limited space available to store large quantities ofdata, and, in some cases, conventional systems require additional spaceto store information to assist query handling.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a block diagram illustrating an embodiment of a system for tagcoexistence detection.

FIG. 2 is a flow chart illustrating an embodiment of a process for tagcoexistence detection.

FIG. 3 is a flow chart illustrating an embodiment of a process for tagcoexistence detection including updating implications.

FIG. 4 is a flow chart illustrating an embodiment of a process for tagcoexistence detection.

FIG. 5 is a block diagram illustrating an embodiment of a system for tagcoexistence detection in a first state.

FIG. 6 is a block diagram illustrating an embodiment of a system for tagcoexistence detection in a second state.

FIG. 7 is a functional diagram illustrating a programmed computer systemfor tag coexistence detection in accordance with some embodiments.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

A system and method of tag coexistence determination is disclosed. Invarious embodiments, a method includes receiving an input data stream,where each data record received in the data stream is tagged with agroup of one or more tags. For each received data record, a group of oneor more tags of the corresponding data record is used to update a datastructure. The data structure is configured to track coexistenceimplications of tags that have been observed together in the groups oftags of the data records (also referred to as “implications database” or“implications table”). The method further includes using the datastructure to optimize a query. By storing non-redundant tags in the datastructure, the functioning of a computer or system of computers can beimproved. For example, time-series data can be de-duplicated whenstoring in memory and/or in disk. The processes described herein allowspace to be conserved and processing time to be improved.

FIG. 1 is a block diagram illustrating an embodiment of a system for tagcoexistence detection. In the example shown, the system includes client110, processor 120, disk 130, and main memory 140. The system may handleboth discrete data and continuous data.

The client 110 is configured to receive input from and provide output toa user. For example, a user may make queries via client 110. In variousembodiments, a query is a request about time series data, e.g., timeseries TS1 and TS2. The time series data may be stored in a storagesystem such as a time series database (TSDB) (disk 130). Time seriesdata may be discrete or continuous. For example, the data may includelive data fed to a discrete stream, e.g., for a standing query.Continuous sources may include analog output representing a value as afunction of time. Continuous data may be time sensitive, e.g., reactingto a declared time at which a unit of stream processing is attempted, ora constant, e.g., a 5V signal. Discrete streams may be provided toprocessing operations (e.g., operations of processor 120) in timestamporder. In various embodiments, if data arrives out-of-order, streams maybe rewound to process data in order.

Time series data may include groups of data points and have one or moreassociated tags. The time series data may be received periodically fromvarious machines. The time series data may be analyzed to determine astatus of the machine and/or source of the data. Using the example of acompany that provides and monitors point of sale devices, several timeseries data streams may be collected by the point of sale devices.

A point of sale device may report one or more metrics for a data stream.Each reported metric may have any number of tags. Suppose a metric issystem battery capacity and a host is described by an operating systemof the point of sale device. Tags associated with this metric and hostmay include a serial number of the device, a merchant ID, an API URL, aplatform, a target (e.g., U.S. product), a ROM version, an app version,and an app package. The tags associated with this metric may haveimplications. For example, a particular merchant may use the same appversion for all of their hosts. In this scenario, the merchant ID tagimplies the app version tag.

The point of sale device may report other types of data. For example, afirst data stream may be temperature detected by a temperature sensorprovided in a point of sale device. For example, temperature may becollected every five minutes throughout the day by the sensor. Thecollected data is stored as a first time series data stream and taggedwith an identification of the point of sale device and geographicallocation of the point of sale device. A second data stream is processingcapacity. For example, the percentage of CPU being used is collectedevery 30 seconds throughout the day and stored as a second time seriesdata stream. The second time series data stream is tagged withcharacteristics of the hardware and operating system of the associatedpoint of sale device and a geographical location. The first and secondtime series data may be analyzed individually or aggregated to provideanalytics regarding the point of sale devices. For example, queries onthe time series data may be executed to determine information such astrends in temperature or CPU usage in a particular time period. Inparticular, suppose one point of sale device appears sluggish for afixed period of time each day. A query may be executed on time seriesdata to determine the cause of the sluggishness. For example, the querymay be for all point of sale devices with such a pattern ofsluggishness. The result of the query may help to identify the cause ofthe problem such as a faulty software component causing light sensors tomisbehave in certain lighting conditions.

In various embodiments, a checkpointing system may pause processing towait for a next data point and resume processing from an earliercheckpoint to maintain a processing order. In various embodiments,processing is performed based on a wall clock that drives a stream graphto produce expected output for a particular time. The wall clock may be,but need not be, equivalent to real time. For example, when processinghistorical data or producing a live view of time series data over a timewindow, the wall clock may be used to drive output from an earlier time(typically a left-most side of a timeline), moving through time andcollecting values from the output, then switching to a standing mode toawait discrete tuples to arrive, as the wall clock advances inreal-time.

The processor 120 is configured to process input time streams accordingto queries. In this example, time series TS1 and/or TS2 may be processedaccording to the query received via client 110. The processor 120 may beconfigured to perform one or more of the processes described herein,e.g., the processes shown in FIGS. 2-4. An example of a processor isprocessor 702 of FIG. 7.

In various embodiments, the processor 120 may include a parser, acompiler, and/or an executor. A parser may be configured to parse aquery, e.g., translate a received query to a language understandable bya compiler. A compiler may be configured to produce a query executionplan. For example, a compiler may receive a parsed query and determinerelevant time series data to retrieve. An executor may be configured toreceive commands from the compiler and perform operations on the timeseries data. For example, an executor may be configured to fetch data,run data through processing operations, and determine a response to thequery based on the execution plan. Processes performed by processor 120may be made more efficient by selecting and storing non-redundantinformation in main memory 140 and/or disk 130, as further describedherein.

The disk 130 is configured to store data. In various embodiments, disk130 is a non-volatile memory configured to store time-series data andother types of data. In some embodiments, disk 130 is configured tostore a more comprehensive set of data compared with main memory 140.The data stored in disk 130 typically takes longer to access comparedwith main memory 140 but may be more comprehensive. In variousembodiments, disk 130 stores telemetry data 132 (referred to as a“telemetry data structure” or “telemetry table”). Telemetry data mayinclude an association between a value and a customer, metric, host,timestamp, and one or more tags. In various embodiments, disk 130 storesindex data (referred to as an “index data structure” or “index table”).The index table may include an association of a last reported value ortime and a customer, one or more tags, metric, and host.

The main memory 140 is configured to store data. In various embodiments,main memory 140 is a volatile memory that stores data that is quicklyretrievable in response to queries (e.g., main memory 140 behaves like acache). In various embodiments, after accessing a first set of data indisk 130, a copy of the first set of data is stored in main memory 140to facilitate faster retrieval for subsequent queries involving thefirst set of data. In various embodiments, main memory 140 storesimplications data (referred to as “implications data structure” or“implications table”). In various embodiments, an implication between afirst tag and a second tag is an “if and only if” relationship betweenthe two tags. That is, a first tag is seen if and only if a second tagis seen and a second tag is seen if and only if a first tag is seen. Animplication means a query of the first tag has a response that isidentical to a query of the second tag.

In various embodiments, main memory 140 stores query planning data(referred to as a “query planning data structure” or “query planningtable”). The query planning table may facilitate query planning bymaking data quickly and easily accessible in anticipation of queries.The query planning table may be dynamically updated according to theprocesses described herein, e.g., the process of FIG. 2.

Conventionally, what is stored in main memory 140 is a mirror image ofwhat is stored in disk 130. Here, less memory can be used and processingspeed can be improved by prudently selecting and storing the selecteddata in main memory. For example, by determining and storing theimplications and the query planning table, a single non-redundant set ofdata may be stored.

In operation, the system shown in FIG. 1 receives a query via client110. The compiler determines what to retrieve from time series databases(disk 130) based on the query. For example, the compiler determines howmany scans to make on the time series databases. The compiler then handsoff commands to an executor to perform an execution phase, e.g.,beginning execution of the query. In various embodiments, processingspeed is improved by making some data (e.g., a last reported valueassociated with a tag) available in a query planning table 144. Wheredata is available in the query planning table 144, the executor need notscan the time series databases to obtain the data in response to thequery. Instead, the data can be retrieved quickly from the queryplanning table 144. In various embodiments, the executor determines ananswer to the query by looking up one or more tags of the query in aquery planning table. If the tags are not found in the query planningtable, then the executor looks up the tag in disk 130 (e.g., in indexdata structure 134). The executor then outputs an answer to the query.Although shown as a single stream, the answer to the query may includeone or more streams.

FIG. 2 is a flow chart illustrating an embodiment of a process for tagcoexistence detection. The process of FIG. 2 may be at least in partimplemented on one or more components of the system shown in FIG. 1. Forexample, processor 120 may be configured to perform the process withrespect to telemetry data 132, index data 134, implications data 142,and query planning data 144. In some embodiments, process 200 isperformed by processor 702 shown in FIG. 7.

At 202, an input data stream is received. The input data stream mayinclude one or more data records. Each data record may include one ormore identifiers. In various embodiments, the input data stream includesa string. For example, the input data stream may include identifiersencoding information about aspects of associated time series data.Example identifiers include a host, a platform, or a target, among otherthings. The host may include hardware information or operating systemdetails. The platform may include module-specific information. Thetarget may include product or region information.

At 204, one or more tags are identified for each data record in theinput data stream. In various embodiments, a string included in theinput data stream is converted to one or more identifiers. Each of theidentifiers may be assigned to one of several IDs. Example IDs include ametric ID, a host ID, a tag, and the like. In some embodiments, all IDsare tags but may be given a specific name (e.g., “metric ID”) if theyhave a specific classification. Identifiers may vary from context tocontext and may be adapted to application needs.

At 206, one or more implications are identified based on the identifiedtag(s). Implications may be identified from the tags because they areseen together for a particular data record in the input data stream. Forexample, if tags T1, T2, and T3 are identified for an input stream, thenthree implications are determined therefrom: T1 implies T1, T2, and T3;T2 implies T1, T2, and T3; and T3 implies T1, T2, and T3. This meansthat a query for T1 yields the same results as a query for T2 or a queryfor T3. The identification of implications allows a single set of datato be stored instead of three redundant sets of data in this example oftag T1, T2, and T3.

At 208, an implications data structure is updated based on theidentified one or more implications. Referring to the example shown inFIG. 1, implications table 142 is updated. The implications table may beupdated to indicate redundancies. Using the example of a set ofimplications in which T1 implies T1, T2, and T3; T2 implies T1, T2, andT3; and T3 implies T1, T2, and T3, the redundancies indicated are T1,T2, and T3. That is, T1, T2, and T3 can point to a same set of data forquery planning purposes. An example of updating the implications tableis further described herein with respect to FIG. 3.

At 210, a query planning data structure is updated including storingnon-redundant metrics for tags based on the implications data structure.The query planning table may be updated based on the updatedimplications table (208). An updated query planning table facilitatesefficient responses to queries. Referring to the example shown in FIG.1, query planning table 144 stored in main memory is updated. In someembodiments, in the query planning data structure, each tag has apointer to a data set. Based on implications information, one or moretags may point to the same data set. For example, if a first tag impliesa second tag, the first tag and second tag point to the same data set.

The query planning data structure may optimize query handling byincreasing a response speed and optimally utilizing memory compared withsystems that do not have a query planning data structure or systems thathave a less effective query planning data structure. A data set may beloaded from disk the first time a query associated with the data set ismade. Once the data set is loaded to memory from disk, subsequentqueries involving the data set may simply retrieve data set informationfrom memory without needing to refer to disk. In one aspect, retrievinga query result from memory instead of disk saves time because there is asmaller space to search for the query response. In another aspect, morenon-redundant information can be stored in the query planning datastructure because redundant information is not unnecessarily taking upspace in the query planning data structure.

Suppose a query planning table stores a single data set of tags T1, T2,and T3 (because T1 implies T1, T2, and T3). A first query is for a dataset associated with a first tag, T1, which causes the data set to beloaded from disk into the query planning table. A subsequent query for adata set associated with a second tag, T2, causes the query to behandled with the query planning table without needing to retrieve thedata set from disk. Despite a query for tag T2 having never been madebefore, the query can be quickly handled because a processor determinesthat T2 implies T1, T2, and T3 and is able to use the data set for tagT1, which has already been loaded into the query planning table.

At 212, a telemetry data structure is updated including storing anon-redundant set of tags. Referring to the example shown in FIG. 1,telemetry table 132 stored in disk is updated. In various embodiments,one of the tags is selected for storage. The tag that is selected to bestored may be based on a variety of metrics to maximize efficiency andstorage. In other words, the tag selected for storage may take the leastamount of storage space. For example, a lexicographically simplest tagis stored. That is, redundant tags that are lexicographically morecomplex are not stored in favor of storing one representative tag, wherethe one representative tag is less complex. As another example, the tagthat is stored has a smallest number of bits compared with otherredundant tags.

At 214, an index table is updated. Referring to the example shown inFIG. 1, index table 134 stored in disk is updated. In some embodiments,an index table stores a copy of data for each tag. That is, each tagpoints to its own copy of data. In some embodiments, an index tablestores non-redundant data such that tags having a common data set pointto a single copy of the data set.

FIG. 3 is a flow chart illustrating an embodiment of a process for tagcoexistence detection including updating implications. The process ofFIG. 3 may be at least in part implemented on one or more components ofthe system shown in FIG. 1. For example, processor 120 may be configuredto perform the process with respect to implications data 142. In someembodiments, at least a portion of the process of FIG. 3 is included in208 of FIG. 2. In some embodiments, process 300 is performed byprocessor 702 shown in FIG. 7.

At 302, a tag is received. The tag may be identified from an input datastream. An example of tag identification is described with respect to204 of FIG. 2

At 304, it is determined whether the tag is new. In various embodiments,a tag is new if the tag has not been previously seen before. Suppose afirst input data stream includes tags T1, T2, and T3 and a second inputdata stream includes tags T2, T3, T4, and T5. When processing the secondinput data stream tags, tags T4 and T5 are determined to be new becausethey have not been seen before. Tags T2 and T3 are determined to not benew because they were previously seen in the first input data stream. Ifthe tag is new, the process proceeds to 306.

At 306, the tag and implications are stored as an added entry in theimplications data structure. For example, if the implications are T1implies T1, T2, and T3; T2 implies T1, T2, and T3; and T3 implies T1,T2, and T3, these three implications are stored in the implications datastructure.

At 308, it is determined whether the entries in the implications datastructure are consistent with the added entry. For example, apreviously-stored entry in an implications table may be inconsistentwith the added entry because the entry conflicts with the new tag andits implications.

If there are inconsistent entries, the process proceeds to 310 in whichinconsistent entries in the implications data structure are corrected.In some embodiments, the implications table does not include anyinconsistent entries. For example, the received new tag and itsimplications are consistent with previously stored entries in theimplications table. If there are consistent entries in the implicationsdata structure at 308 or the tag is determined to not be new at 304, theprocess ends.

FIG. 4 is a flow chart illustrating an embodiment of a process for tagcoexistence detection. The process of FIG. 4 may be at least in partimplemented on one or more components of the system shown in FIG. 1. Forexample, processor 120 may be configured to perform the process withrespect to implications data 142. In some embodiments, at least aportion of the process of FIG. 4 is included in 308 and/or 310 of FIG.3. In some embodiments, process 400 is performed by processor 702 shownin FIG. 7.

At 402, a list of tags is received. In various embodiments, the list oftags includes tags identified from an input data stream such as tagsidentified in 204 of FIG. 2

404-410 may be performed for each tag in a record, e.g., for each tag inan entry in an implications table. At 404, it is determined whether acorresponding entry in the implications data structure is consistentwith the tag and implications for the tag. Suppose a new implication isT2 implies T2, T3, T4, and T5. An entry in which T2 implies T1, T2, T3is inconsistent with the new implication because T2 does not imply T1.To make the entry consistent with the new implication, the entry may bemodified as further described herein.

If the corresponding entry in the implications table is consistent withthe tag and implications for the tag, the corresponding entry remains inthe implications data structure (406). For example, no changes are madeto the entry.

If the corresponding entry in the implications table is inconsistentwith the tag and implications for the tag, the corresponding entry isupdated (408). For example, those implications that are no longerconsistent with the added entry are modified or removed. Referring tothe example in which a new implication is T2 implies T2, T3, T4, and T5being compared with an entry in which T2 implies T1, T2, T3, the entrymay be modified to: T2 implies T2, T3.

At 410, other entries in the implications table are updated to beconsistent with the updated corresponding entry.

The processes shown in FIGS. 2-4 will now be explained using theexamples of FIGS. 5 and 6. FIG. 5 is a block diagram illustrating anembodiment of a system for tag coexistence detection in a first state.The system includes a processor 520, disk 530, and main memory 540. Anexample of processor 520 is processor 120 of FIG. 1. An example of disk530 is disk 130 of FIG. 1. An example of main memory 540 is memory 140of FIG. 1.

In the first state, contents of telemetry table 532, index table 534,implications table 542, and query planning table 544 are as shown. FIG.6 is a block diagram illustrating an embodiment of a system for tagcoexistence detection in a second state. The system includes a processor620, disk 630, and main memory 640. An example of processor 620 isprocessor 120 of FIG. 1. An example of disk 630 is disk 130 of FIG. 1.An example of main memory 640 is memory 140 of FIG. 1. In the secondstate, contents of telemetry table 632, index table 634, implicationstable 642, and query planning table 644 are as shown.

In the example shown in FIG. 5, each of the data structures (telemetrytable 532, index table 534, implications table 542, and query planningtable 544) start out empty. An input data stream is received. In thisexample, three tags (T1, T2, T3) are identified from the input datastream. The three tags have an associated value, V1.

Implications table 542 may be updated based on the tags of the inputdata stream as follows. In this example, none of the three tags havebeen seen before. That is, each of the three tags is “new.” Because thetags are new, each tag and its implications are stored as an added entryin the implications table. Referring to implications table 542, an entryfor tag T1 is “T1→T1, T2, T3,” an entry for tag T2 is “T2→T1, T2, T3,”and an entry for tag T3 is “T3→T1, T2, T3.” This means that T1 impliesT1, T2, T3; T2 implies T1, T2, T3; and T3 implies T1, T2, T3. After thetags have been stored in the implications table, it may be determinedwhether other entries in the implications table are consistent with theadded entries. Here, because the implications table was empty before T1,T2, and T3 were processed, all entries in the implications table areconsistent with each other.

Query planning table 544 may be updated based on the updatedimplications table 542 as follows. An entry for each of the tags, T1,T2, and T3, may be created in the query planning table. Each of theentries points to a same data set, Data Set A, because the implicationsindicate that T1 implies T1, T2, T3; T2 implies T1, T2, T3; and T3implies T1, T2, T3.

Telemetry table 532 may be updated based on the updated implicationstable 542 as follows. The received tags and corresponding values andmetrics may be stored as an entry in the telemetry table. For example,value V1 is associated with a customer, metric, host, timestamp, and tagT. In this example, only tag T1 is stored because T1 implies T1, T2, andT3.

Index table 534 may be updated by storing each of the tags with anassociated data set. Here, each of the tags T1, T2, and T3 is storedwith a respective copy of Data Set A. In an alternative embodiment (notshown), a single copy is stored. For example, a single copy of Data SetA is stored, and T1, T2, and T3 point to the same copy of Data Set A.This may reduce space needed to store data in the index table 534.

Referring now to FIG. 6, a second input data stream is received. Here,the second input data stream is received after the first input datastream (T1, T2, T3=V1) is received. In this example, the second inputdata stream has four tags (T2, T3, T4, and T5). The four tags have anassociated value, V2. The implications based on the second input datastream are: T2 implies T3, T4, T5; T3 implies T2, T4, T5; T4 implies T2,T3, T5; and T5 implies T2, T3, T4. Each of the data structures(telemetry table 632, index table 634, implications table 642, and queryplanning table 644) may be updated in response to the second input datastream. For example, information reflecting the new implications for thesecond input data stream may be stored in a manner consistent with theimplications for the first input data stream.

Implications table 642 may be updated based on the tags of the secondinput data stream as follows. In this example, tags T2 and T3 have beenseen before (they are not new because they were seen in the first inputdata stream) and tags T4 and T5 are new. With respect to the new tags(T4, T5), the second input data stream indicates the followingimplications: T4 implies T4, T5; and T5 implies T4, T5.

As shown, implications for tags T4 and T5 (T4 implies T4, T5; and T5implies T4, T5) are stored as added entries in the implications table642. After the new implications are stored and before consistency withother entries is checked, the implications table (not shown) containsthe following entries: T1 implies T, T2, T3; T2 implies T1, T2, T3; T3implies T1, T2, T3; T4 implies T4, T5; and T5 implies T4, T5. In thisexample, some previously-stored entries are inconsistent with the addedentries. In particular, entry T1 implies T1, T2, T3 is inconsistent withimplications of the second input data stream because T2 and T3 appearwithout T1 in the second input data stream. Thus, this entry is removedfrom the table. Entry T2 implies T1, T2, T3 is inconsistent withimplications of the second input data stream because T2 and T3 appeartogether but without T1. This entry is updated to T2 implies T2, T3 asshown. Entry T3 implies T1, T2, T3 is inconsistent with implications ofthe second input data stream because T2 and T3 appear together butwithout T1. This entry is updated to T3 implies T2, T3 as shown. Theupdating of the implications table 642 to correct inaccuracies resultsin the implications table shown in FIG. 6.

Query planning table 644 may be updated based on the updatedimplications table 642 as follows. Query planning table 544 of FIG. 5shows a first state of the table (before updates). Referring toimplications table 642, T1 no longer implies T1, T2, T3. Thus, thepointers from T2 and T3 to the copy of Data Set A associated with T1 areremoved. T2 is associated with its own copy of Data Set A and Data SetB. Because T2 implies T2, T3, T3 points to the copy of Data Set A andData Set B associated with T2. Because T4 implies, T4, and T5, T4 and T5can both point to the same copy of Data Set C as shown. In variousembodiments, query planning table data sets (Data Set A, Data Set B,Data Set C) may be loaded from index table 634 the first time a query ismade for a tag. After this retrieval, the corresponding data set isstored in the query planning table and subsequent queries do not triggera lookup in disk 630. This advantageously improves processing timewithout a slow initial start-up time, e.g., populating query planningtable 644 with data sets before the first query is made.

Telemetry table 632 may be updated based on the updated implicationstable 642 as follows. Telemetry table 532 of FIG. 5 shows a first stateof the table (before updates). For example, value V1 is associated witha customer, metric, host, timestamp, and tag T1. Based on the secondinput data stream, the first entry is updated to also associate tag T2with customer, metric, host, timestamp and V1. A second entry is addedto associate value V2 with a customer, metric, host, timestamp, and tagsT2 and T4.

Index table 634 may be updated by storing each of the tags with anassociated data set. Here, each of the tags T1, T2, and T3 is storedwith a respective copy of Data Set A. In an alternative embodiment, asingle copy is stored. For example, a single copy of Data Set A, DataSet B, and Data Set C is stored. T2 and T3 point to the same copy ofData Set A and Data Set B. This may reduce space needed to store data inthe index table 634.

In various embodiments, one or more of the data structures describedherein (e.g., telemetry table 132, implications table 142, queryplanning table 144, and index table 134 are implemented by a bloomfilter. The bloom filter may be adjusted for missed pairings. The bloomfilter may be programmed to control a frequency of false positives. Forexample, a false positive rate (e.g., around 1%) may be specified whenthe bloom filter is created. The bloom filter may track seen tags andtypically does not provide false negatives.

FIG. 7 is a functional diagram illustrating a programmed computer systemfor tag coexistence detection in accordance with some embodiments. Aswill be apparent, other computer system architectures and configurationscan be used to detect tag coexistence. Computer system 700, whichincludes various subsystems as described below, includes at least onemicroprocessor subsystem (also referred to as a processor or a centralprocessing unit (CPU)) 702. For example, processor 702 can beimplemented by a single-chip processor or by multiple processors. Insome embodiments, processor 702 is a general purpose digital processorthat controls the operation of the computer system 700. Usinginstructions retrieved from memory 710, the processor 702 controls thereception and manipulation of input data, and the output and display ofdata on output devices (e.g., display 718). In some embodiments,processor 702 includes and/or is used to execute/perform the processesdescribed below with respect to FIGS. 2-4.

Processor 702 is coupled bi-directionally with memory 710, which caninclude a first primary storage, typically a random access memory (RAM),and a second primary storage area, typically a read-only memory (ROM).As is well known in the art, primary storage can be used as a generalstorage area and as scratch-pad memory, and can also be used to storeinput data and processed data. Primary storage can also storeprogramming instructions and data, in the form of data objects and textobjects, in addition to other data and instructions for processesoperating on processor 702. Also as is well known in the art, primarystorage typically includes basic operating instructions, program code,data, and objects used by the processor 702 to perform its functions(e.g., programmed instructions). For example, memory 710 can include anysuitable computer-readable storage media, described below, depending onwhether, for example, data access needs to be bi-directional oruni-directional. For example, processor 702 can also directly and veryrapidly retrieve and store frequently needed data in a cache memory (notshown).

A removable mass storage device 712 provides additional data storagecapacity for the computer system 700, and is coupled eitherbi-directionally (read/write) or uni-directionally (read only) toprocessor 702. For example, storage 712 can also includecomputer-readable media such as magnetic tape, flash memory, PC-CARDS,portable mass storage devices, holographic storage devices, and otherstorage devices. A fixed mass storage 720 can also, for example, provideadditional data storage capacity. The most common example of massstorage 720 is a hard disk drive. Mass storages 712, 720 generally storeadditional programming instructions, data, and the like that typicallyare not in active use by the processor 702. It will be appreciated thatthe information retained within mass storages 712 and 720 can beincorporated, if needed, in standard fashion as part of memory 710(e.g., RAM) as virtual memory.

In addition to providing processor 702 access to storage subsystems, bus714 can also be used to provide access to other subsystems and devices.As shown, these can include a display monitor 718, a network interface716, a keyboard 704, and a pointing device 706, as well as an auxiliaryinput/output device interface, a sound card, speakers, and othersubsystems as needed. For example, the pointing device 706 can be amouse, stylus, track ball, or tablet, and is useful for interacting witha graphical user interface.

The network interface 716 allows processor 702 to be coupled to anothercomputer, computer network, or telecommunications network using anetwork connection as shown. For example, through the network interface716, the processor 702 can receive information (e.g., data objects orprogram instructions) from another network or output information toanother network in the course of performing method/process steps.Information, often represented as a sequence of instructions to beexecuted on a processor, can be received from and outputted to anothernetwork. An interface card or similar device and appropriate softwareimplemented by (e.g., executed/performed on) processor 702 can be usedto connect the computer system 700 to an external network and transferdata according to standard protocols. For example, various processembodiments disclosed herein can be executed on processor 702, or can beperformed across a network such as the Internet, intranet networks, orlocal area networks, in conjunction with a remote processor that sharesa portion of the processing. Additional mass storage devices (not shown)can also be connected to processor 702 through network interface 716.

An auxiliary I/O device interface (not shown) can be used in conjunctionwith computer system 700. The auxiliary I/O device interface can includegeneral and customized interfaces that allow the processor 702 to sendand, more typically, receive data from other devices such asmicrophones, touch-sensitive displays, transducer card readers, tapereaders, voice or handwriting recognizers, biometrics readers, cameras,portable mass storage devices, and other computers.

In addition, various embodiments disclosed herein further relate tocomputer storage products with a computer readable medium that includesprogram code for performing various computer-implemented operations. Thecomputer-readable medium is any data storage device that can store datawhich can thereafter be read by a computer system. Examples ofcomputer-readable media include, but are not limited to, all the mediamentioned above: magnetic media such as hard disks, floppy disks, andmagnetic tape; optical media such as CD-ROM disks; magneto-optical mediasuch as optical disks; and specially configured hardware devices such asapplication-specific integrated circuits (ASICs), programmable logicdevices (PLDs), and ROM and RAM devices. Examples of program codeinclude both machine code, as produced, for example, by a compiler, orfiles containing higher level code (e.g., script) that can be executedusing an interpreter.

The computer system shown in FIG. 7 is but an example of a computersystem suitable for use with the various embodiments disclosed herein.Other computer systems suitable for such use can include additional orfewer subsystems. In addition, bus 714 is illustrative of anyinterconnection scheme serving to link the subsystems. Other computerarchitectures having different configurations of subsystems can also beutilized.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

1. A method for optimizing data storage, comprising: receiving a timeseries data stream comprising a plurality of data records, wherein eachdata record received in the time series data stream is tagged with agroup of one or more tags; identifying one or more implications based ontags that have been observed together in the groups of one or more tagsof the plurality of data records, wherein an implication indicates thata query of a first tag has a response that is identical to a query of asecond tag; tracking coexistence implications of tags that have beenobserved together in the groups of one or more tags of the data records;and using the coexistence implications of tags to optimize a query. 2.The method of claim 1, wherein the tracking the coexistence implicationsof tags that have been observed together in the groups of one or moretags of the data records comprises: updating a query planning datastructure.
 3. The method of claim 1, wherein the tracking thecoexistence implications of tags that have been observed together in thegroups of one or more tags of the data records comprises: updating aquery planning data structure to store non-redundant metrics for atleast one tag of the groups of one or more tags.
 4. The method of claim1, wherein a query planning data structure is stored in a volatilememory.
 5. The method of claim 1, wherein the tracking the coexistenceimplications of tags that have been observed together in the groups ofone or more tags of the data records comprises: updating a telemetrydata structure.
 6. The method of claim 1, wherein the tracking thecoexistence implications of tags that have been observed together in thegroups of one or more tags of the data records comprises: updating atelemetry data structure to store a non-redundant association of a tagto metrics.
 7. The method of claim 1, wherein the tracking thecoexistence implications of tags that have been observed together in thegroups of one or more tags of the data records comprises: updating anindex data structure.
 8. The method of claim 1, further comprising:updating a data structure tracking the coexistence implications of tagsthat have been observed together in the groups of one or more tags ofthe plurality of data records based on the one or more implications. 9.The method of claim 8, wherein the tracking the coexistence implicationsof tags that have been observed together in the groups of one or moretags of the data records comprises: determining whether a tag of thegroup of one or more tags is new; responsive to the determination thatthe tag is new, storing the tag and associated implications as an addedentry in the data structure; determining whether another entry in thedata structure is consistent with the added entry; and responsive to thedetermination that the other entry is inconsistent with the added entry,correcting the inconsistent entry in the data structure.
 10. The methodof claim 8, further comprising: using the data structure to reduce astorage size of a stored version of at least one of the received datarecords.
 11. The method of claim 8, wherein the data structure is storedin a non-volatile memory.
 12. A system for optimizing data storage,comprising: a processor configured to: receive a time series data streamcomprising a plurality of data records, wherein each data recordreceived in the time series data stream is tagged with a group of one ormore tags; identify one or more implications based on tags that havebeen observed together in the groups of one or more tags of theplurality of data records, wherein an implication indicates that a queryof a first tag has a response that is identical to a query of a secondtag; track coexistence implications of tags that have been observedtogether in the groups of one or more tags of the data records; and usethe coexistence implications of tags to optimize a query; and a volatilememory coupled to the processor and configured to store the coexistenceimplications of tags that have been observed together in the groups ofone or more tags of the data records.
 13. The system of claim 12,wherein the volatile memory is further configured to store a queryplanning data structure, wherein the query planning data structureoptimizes responses to queries.
 14. The system of claim 13, wherein theprocessor is further configured to update the query planning datastructure using the coexistence implications of the tags that have beenobserved together in the groups of one or more tags of the data records.15. The system of claim 13, wherein the processor is further configuredto update the query planning data structure to store non-redundantmetrics for at least one tag of the groups of one or more tags using thecoexistence implications of the tags that have been observed together inthe groups of one or more tags of the data records.
 16. The system ofclaim 12, further comprising a non-volatile memory coupled to theprocessor and configured to store a telemetry data structure, whereinthe telemetry data structure is configured to store a non-redundantassociation of a tag to metrics.
 17. The system of claim 16, wherein theprocessor is further configured to update the telemetry data structureto remove a redundant association of a tag to metrics using thecoexistence implications of the tags that have been observed together inthe groups of one or more tags of the data records.
 18. The system ofclaim 12, further comprising a non-volatile memory coupled to theprocessor and configured to store an index data structure.
 19. Thesystem of claim 12, wherein the processor is further configured to usereduce a storage size of a stored version of at least one of thereceived data records using the coexistence implications of the tagsthat have been observed together in the groups of one or more tags ofthe data records.
 20. A computer program product for optimizing datastorage, the computer program product being embodied in a non-transitorycomputer readable storage medium and comprising computer instructionsfor: receiving a time series data stream comprising a plurality of datarecords, wherein each data record received in the time series datastream is tagged with a group of one or more tags; identifying one ormore implications based on tags that have been observed together in thegroups of one or more tags of the plurality of data records, wherein animplication indicates that a query of a first tag has a response that isidentical to a query of a second tag; tracking coexistence implicationsof tags that have been observed together in the groups of one or moretags of the data records; and using the coexistence implications of tagsto optimize a query.